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Abstract — In previous studies, a variety of unsupervised anom- 
aly detection techniques for anomaly detection were applied to 
SSME (Space Shuttle Main Engine) data. The observed results 
indicated that the identification of certain anomalies were specific 
to the algorithmic method under consideration. This is the reason 
why one of the follow-on goals of these previous investigations 
was to build an architecture to support the best capabilities of 
all algorithms. We appeal to that goal here by investigating a 
cascade, serial architecture for the best performing and most 
suitable candidates from previous studies. 

As a precursor to a formal ROC (Receiver Operating Char- 
acteristic) curve analysis for validation of resulting anomaly 
detection algorithms, our primary focus here is to investigate 
the model fidelity as measured by variants of the AIC (Akaike 
Information Criterion) for state-space based models. We show 
that placing constraints on a state-space model during or after the 
training of the model introduces a modest level of suboptimality. 
Furthermore, we compare the fidelity of all candidate models 
including those embodying the cascade, serial architecture. We 
make recommendations on the most suitable candidates for 
application to subsequent anomaly detection studies as measured 
by AlC-based criteria. 

I. Introduction 

This paper is a continuation of two previous studies [18] 
- [19], in which various unsupervised anomaly detection 
algorithms were applied to SSME data. The SSME (Space 
Shuttle Main Engine) is a complex re-usable liquid propulsion 
system, and is outfitted with a comprehensive array of sensors 
(vibration, facility, and controller measurements). There are 
three SSME’s and two SRB’s (Solid Rocket Boosters) used to 
support the launch of the space shuttle for ongoing missions 
prior to its retirement. Although the shuttle is due to be retired 
in 2010, the investigation of anomaly detection algorithms 
applied to this SSME dataset can be justified for a variety 
of reasons. 

The acquisition of authentic operational and supporting truth 
data from which to perform rigorous statistical analyses is 
a rare commodity from the perspective of propulsion for 
space applications due to their design for high reliability 
and relatively low failure rates. While the SSME dataset 
described herein does not by any means represent a consistent, 
comprehensive dataset from which to generate a statistically 
significant analysis, there are certainly ways to interpret and 
analyze the data in a manner that may serve to support 


continued achievement of ISHM (Integrated System Health 
Management) goals. 

Furthermore, this dataset can act as a baseline for the 
development of algorithms related to future generations of 
spaceflight, i.e. the Ares I and Ares I-X launch [29]. The 
methods investigated and developed from this dataset are 
also certainly more generally applicable to a broader class of 
IVHM (Integrated Vehicle Health Management) application 
platforms. One example would be the application of derived 
techniques to civil aeronautics platforms, and more fundamen- 
tally to aeronautics research. As such, even though the dataset 
is application-specific, our intent is to demonstrate the utility 
of our findings from a much broader perspective. 

One of our primary goals in this paper is to introduce the 
notion of model fidelity, a topic previously lacking in other 
analyses. The suite of algorithms under consideration for our 
purposes here is a smaller subset of ones used in previous 
analyses. Many of the algorithms used previously were fairly 
mature and had been successfully deployed onboard critical 
application platforms. Others operated at lower TRL (technol- 
ogy readiness level), but are nonetheless viable candidates for 
consideration. The more mature algorithms are IMS (Inductive 
Monitoring System) [13], GritBot, and Orca [1], [28]. The 
research-stage algorithms are SVM (Support Vector Machines) 
[4], various implementations of the GMM (Gaussian Mixture 
Model), and an LDS (Linear Dynamic System). 

First we will establish the requirements making an algorithm 
a suitable candidate for inclusion in the serial architecture in 
Sec. II. An appropriate selection is made based upon these 
requirements. We will then provide a detailed discussion of 
the state-space model framework in Sec. III. This section 
will also cover various initialization and learning strategies 
under consideration in Sec. Ill- A. The model likelihood and 
its relationship to the Kalman filter updates is detailed in Sec. 
III-B, and the method for assessing model fidelity is covered 
in Sec. III-C. Finally, we will present the results in Sec. IV 
and provide concluding remarks in Sec. V. 

II. Requirements 

The requirements that would make an algorithm a viable 
candidate for our purposes here are provided in the following 
list. 



TABLE II 

Training/Validation Breakdown 


Data Sources 

Training 

Nominal 

Validation 

Nominal Potential Anomalies 


STS-77 (#1) 

STS- 103 (#2) 

STS-77 (#2) 


STS-78 (#1) 

STS-103 (#3) 

STS-91 (#1) 


STS-78 (#2) 

STS- 106 (#1) 

STS-93 (#1) 


STS-78 (#3) 

STS- 106 (#2) 

STS-93 (#3) 


A10851 

A10852 

A10853 


A20726 

A20750 

A20619 


1) The method is conducive to temporal analysis (i.e. 
a likelihood, probability, or other relevant score-based 
metric can be generated for each point in time to allow 
for the construction of an ROC curve and subsequent 
design of an alarm system). 

2) The method provides an informative composite non- 
zero score for a multivariate time series dataset that 
is available during the training phase, and will retain 
inherent dynamic structure. 

3) The scores can be compared to some relevant prede- 
termined threshold, whether derived statistically, exper- 
imentally, or is inherent to the algorithm itself. By 
constructing a candidate level-crossing event involving 
the output of a linear dynamic state-space system and a 
relevant failure threshold, we can mitigate false alarms 
by invoking the principle of optimal alarm as suggested 
in [22]. 

The first of these requirements, Req. 1 addresses the lack 
of sufficient data per flight cycle that has been categorized 
as containing anomalies, faults, or failures. This dearth of 
anomalous truth data inhibits the ability to generate an ROC 
curve with any reasonable level of statistical significance. 
However, using the same dataset we can address this lack of 
data by constructing an ROC curve based upon a temporal 
labelling of the truth data in lieu of per flight cycle. The time 
and severity of each anomaly corresponding to our dataset 
is shown in Table I. Additionally, the descriptions of the 
anomalies and their functional categorizations are provided. 

Due to the availability of the temporal information for all 
anomalies, we can construct a statistically significant ROC 
curve based upon each time point rather than each flight 
cycle. Table II contains the corresponding meta-data for each 
documented anomaly. This table identifies the datasets of 
interest and categorizes them according to their source. They 
are also categorized according to which flights are used to train 
models in this study, and which will be used for validation 
in a subsequent study to be presented in a sequel paper. The 
validation data is partitioned into flights that contain anomalies 
and those that are nominal. As determined in [19], two of the 
flights that were originally categorized as nominal required 
reclassification due to mild anomalies that had not previously 
been labelled as such. This is evident in the anomalous flights 
that have been listed with the time and severity of each 
anomaly shown in Table I. 


w(t) 



Fig. 1. Closed-Loop Control System Block Diagram 


Requirement 2 addresses the need for retaining the inherent 
dynamic structure of the data as well as any potential failure 
or anomalous signatures. In order to construct a model that 
is dynamically informative we require that the training data is 
assigned a composite score that is non-zero for all time points. 
The data must also not be randomized in order to preserve its 
dynamic integrity. This dynamic integrity is important in order 
to ensure development of a model that can learn hidden causal 
relationships during training. 

Finally, requirement 3 addresses the target anomaly detec- 
tion algorithm based upon use of the state-space models that 
we will investigate in subsequent studies. The incorporation 
of a predefined threshold into the design of a state-space 
based alarm system is particularly useful due to the inherent 
capability to mitigate false alarms, as discussed in [22] and 
[17], Therefore, we will use a state-space formulation by 
default, under the assumption that the structure of the model 
is a linear dynamic system driven by Gaussian noise, and has 
a univariate output. 

One of the primary justifications for requirement 3 is related 
to the introduction of an architecture to support the best 
capabilities of all algorithms. A cascade, serial architecture 
as shown in Fig. 2 promotes synergism among the best per- 
forming and most suitable candidates from previous studies. 

The idea behind using control system error as the sole 
indicator for anomaly detection (shown in Fig. 1 as e(t)) lies in 
the fact that the SSME throttle control system was most likely 
designed with both reference command following and distur- 
bance rejection in mind. As such, when large disturbances 
influence the plant, V, to the extent that the control system 
cannot reject them expediently, this may be indicative of a 
significant event which is cause for diagnostic investigation. 

In previous studies, it was found that alarm systems devel- 
oped on this concept performed moderately well for specific 
types of anomalies, both in terms of accuracy and time to 
detection. Another reason for using control system error as 
the sole indicator for anomaly detection was the possibility 
of constructing an optimal alarm system based upon a linear 
dynamic system trained by the control system error. As such, 
the principles of optimal alarm can be invoked in order to 
mitigate false alarms. 

However, it was found in [19] that when considering the 
control system error as the basis for training a univariate 





TABLE I 

Characterization of Failures 


Failure Data 

Failure Type 

Time of Anomaly 

Severity 

STS-77 (#2) 

Anomalous Spike in Sensor Reading (Controller) 

74.42 sec 

Mild 

STS-91 (#1) 

Sensor Failure (Controller) 

32.76 sec 

Mild 

A10852 

Mixture Ratio Change (Controller) 

210 sec 

Mild 

STS- 103 (#3) 

Max Noise Failure (Vibration) 

38.1 sec 

Mild 

STS-93 (#1) 

Controller Failure (Controller) 

1 1.38 sec 

Moderate 

STS-93 (#3) 

Fuel Leak and Controller Failure (Controller) 

1 1 .62 sec 

Moderate to Severe 

A20619 

Knife Edge Seal Crack (Vibration) 

119 sec 

Moderate to Severe 

A10853 

Turbine Blade Failure (Vibration) 

130 sec 

Severe 


linear dynamic system, the performance was poor for con- 
troller failures. This was due to the total loss of power to 
the controller, resulting in sensor readings of zero for both 
commanded and actual throttle. Because control system error 
is defined as the difference between commanded and actual 
throttle, as shown in Fig. 1, a very clearly anomalous condition 
can be mistaken for an otherwise nominal value of zero. This 
provides further evidence for the use of an architecture that 
will detect anomalies of all types by incorporating multiple 
methods. 

It was also found in the same study [19] that overall 
accuracy and time to detect for SVM and Orca were better 
than for most other algorithms on average. As such, the idea 
behind the proposed cascade serial architecture shown in Fig. 
2 is to allow for a data reduction that will incorporate the 
characteristics of the algorithms with the best performance. 

Although Fig. 2 illustrates parallel branches, the serial 
portion of the architecture involves the lower branch in which 
a composite anomaly score is generated. After significant pre- 
processing is performed in order characterize nominal behav- 
ior, the anomaly score should contain no pathologically high 
values. This anomaly score can then be used as training data 
for the linear dynamic system. The upper branch corresponds 
to the control system error, which uses a much smaller fraction 
of the feature space than the lower branch (as indicated by the 
thicker line for the lower branch). Independent linear dynamic 
system models are generated from each branch for comparison 
only, and the common training data set is shown as the source 
for both methods solely for convenience. 

There are reasons other than pure performance documented 
in the previous study cited earlier [19] for eliminating certain 
algorithms as candidates for the serial architecture. These 
reasons are mainly related to inadmissability due to the lack 
of meeting the three requirements that have been set forth. 
All of the candidate algorithms with the exception of LDS 
can learn a model based upon a multivariate time series. In 
fact, the SVM algorithm can even generate a composite score 
that can be compared to a relevant predetermined threshold 
that is inherent to its theoretical construction (the margin to 
the hyperplane as measured by Euclidean distance). However, 
none of these techniques other than LDS can independently 
apply the principles of optimal alarm based upon a predefined 
level-crossing event. 

This cascade serial architecture (Fig. 2) is an attempt to 
utilize the best characteristics of algorithmic candidates that 



Fig. 2. Serial Architecture 


meet the requisite criteria. Essentially, SVM and Orca act to 
transform the training data from a multivariate to a univariate 
time series of the type required by LDS for training, via data 
reduction resulting in an anomaly score. It is therefore possible 
that other techniques such as unreleased future versions of 
IMS, or even PCA (principle components analysis) could also 
be used for this data reduction. 

The reason why standard algorithms may be used inter- 
changeably with other types of static maps that are fundamen- 
tally transformations and/or data reduction techniques is due 
to the fact that the training dataset itself is used both during 
the learning and monitoring phases. The LDS algorithm should 
not be subject to “inheriting” undesirable characteristics of the 
“parent” algorithm delivering its anomaly score as training 
data (i.e. SVM or Orca). This is due to the fact that no 
decisions made by the parent algorithm are incorporated into 
LDS validation. As such, the anomaly score should be viewed 
purely as a transformation of the training data. 

Prior to assignment of an anomaly score by these algo- 
rithms, the multivariate time series data is preprocessed by 
z-scoring and using a method called stationarization. Station- 
arization is a technique that has previously been used for 
SSME data analysis by Park et al. [25]. Stationarization is 
used in tandem with z-scoring to remove the effect of non- 
stationarities that arise in the resulting anomaly score as a 
result of varying operational modes, which may otherwise 
trigger spurious alarms. Z-scoring conditions the data so as 


to eliminate any bias introduced by inconsistencies in mea- 
surement units for various parameters. 

The datastream for either branch of the serial architecture 
should appear identical, as shown in Fig. 2. This can be 
enabled by ensuring that the qualitative characteristics of the 
relative frequency histogram of data for the input to the LDS 
learning block appear Gaussian. Alternative Gaussian trans- 
formation methods such as the one described in [3] are viable 
candidates for future study. However, for the purpose of this 
study, we will consider just the two candidate preprocessing 
methods previously described, under a number of different 
settling time scenarios to be discussed shortly. 

Unfortunately, for the SVM algorithm, the stationarization 
step may eliminate the inherent capability of the composite 
score to demonstrate a qualitative representation of the specific 
parameter with the most egregious behavior. The composite 
anomaly score is inherently a distance-based metric, i.e the dis- 
tance from the monitored point to the hyperplane acting as the 
decision boundary. However, if only z-scoring is performed, 
the inherent non-stationarity in the anomaly score will enable 
all operational regimes visited to present as a multimodal 
distribution. Such a distribution is clearly not amenable to 
the univariate Gaussian representation required for training an 
LDS. 

For Orca, the composite score is also a distance-based 
metric, however there is very little difference between the 
qualitative nature of the distributions when using both z- 
scoring and stationarizing as opposed to just using z-scoring 
for preprocessing. Due to the nature of the manner in which 
the distance calculations are performed, the distributions are 
also very skewed to the left. Neither distribution for the SVM 
or Orca scores will be centered at zero. For either distribution 
to possess a non-zero mean does not present an issue, because 
all of the methods to be described use a zero mean without 
loss of generality, and are used for mathematical convenience. 

Fig. 3 shows the relative frequency histogram of the control 
system error data resulting in the realization of a learned 
model shown to the right of the LDS block shown on top 
of Fig. 2. This is qualitatively a good fit to the Gaussian 
model, as supported by the superimposed fit of the Gaussian 
curve plotted as a function of relevant measured statistics. The 
empirical mean and variance prior to training are functions of 
LDS model parameters, and corresponds to the curve shown 
in red, while the curve shown in cyan is the fit after training. 
In Fig. 3 the distinction between these two curves is barely 
discernable, as they are nearly identical. 

Fig. 4 shows the relative frequency histogram for the anom- 
aly scores of both SVM (top) and Orca (bottom). It is apparent 
that these methods do not provide as good a fit to the data 
under the Gaussian distribution assumption as shown for the 
case of Fig. 3 for control system error. As such, one possibility 
which was explored was to increase the number of points 
removed due to transient behavior by allowing for an increase 
in the length of the recovery period following major throttling 
transients. 1 sec was documented as the settling time allowed 
for and used in previous studies [18], [19], We only consider 


Histogram of Control System Error 



Fig. 3. Empirical Distribution of Control System Error Training Data 


periods of steady-state behavior for the purposes of this and 
previous studies. Transient periods may be investigated for all 
subject algorithms in future work. 

It was found that in a completely qualitative sense, the best 
settling times to allow for the best Gaussian fit to the SVM and 
Orca anomaly scores were 5 sec and 8 sec, respectively. These 
settling times were found by investigating values ranging from 
1 sec to 10 sec. We will use more thorough quantitative 
means with which to assess the fit of the model and its 
Gaussian assumptions to the corresponding score-based data in 
subsequent sections. It is interesting to note that the histograms 
of anomaly scores for both SVM and Orca appear skewed, 
and are skewed in opposite directions to each other. This is 
most likely due to the manner in which the distance-based 
metrics were defined. Furthermore, for the SVM algorithm 
the Gaussian radial basis function (RBF) is used as the kernel 
operator. In future studies, we will attempt to exploit this fact 
to achieve a better qualitative fit of the data to the model, in 
addition to methods such as the one described in [3], 

Of the more mature algorithms, GritBot is a commercially 
available decision tree based algorithm. GritBot does not 
provide a score for each monitored data point, but instead 
provides a list of the top anomalous scores that are ranked 
according to their statistical significance. As such, it is not 
a suitable candidate for real-time monitoring and implemen- 
tation. IMS is an unsupervised machine learning algorithm 
that uses clustering to form a nominal region represented by 
the union of a finite number of hyper-rectangular clusters. By 
definition, IMS assigns a score of zero to all monitored points 
that fall within this nominal region. As such, the nominal 
training data cannot be used to train an LDS model since they 
will all have zero values by default. 

The GMM was the poorest performer in the JANNAF study 
from the perspective of accuracy. However, when implemented 
in its most exhaustive variant (one GMM per parameter), the 
time to detect was on average on par with Orca. The GMM 
is a simple technique to fit multimodal Gaussian distributions 
to data under an IID assumption. Because the GMM is geared 
for isolation of anomalies in this particular implementation, 
there is inherently no viable data reduction technique that 


SVM Anomaly Score using a 5 sec settling time 



Histogram of Composite Orca Scores using 8 sec settling time 
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Fig. 4. Empirical Distribution of SVM and Orca composite anomaly scores 
TABLE III 

Requirements met by Algorithmic Candidates 


model fidelity directly, both quantitatively and qualitatively. 
Ultimately, any such model will also affect its subsequent 
application to this anomaly detection algorithm. We will fully 
discuss the technical details of model fidelity in the subsequent 
section. 

III. Methodology 

We will investigate both quantitative and qualitative aspects 
of model fidelity, as discussed by [2] and [26], respectively. 
Our objective here is to apply reasonable judgements and 
metrics with which to assess and ultimately choose the model 
that best describes the data. Variants of the serial architecture 
shown in Fig. 2 will be explored. Our underlying assumption 
is that we can fit measured or transformed data to a model 
represented by a linear dynamic system driven by Gaussian 
noise. The state-space formulation is shown in Eqns. 1-2. 

Xfc-j-i = Ax fe + w fc (1) 

y k = Cx t + v k (2) 

where 

w fe ~ A/"(0, Q) 

Vk ~ U(0, R) 

x 0 ~ Af(> x ,Po) 

/4x = 

P 0 = £[(X 0 - /X x )(x 0 - /4 x ) T ] 


Requirement 

Orca 

IMS 

SVM 

GritBot 

GMM 

LDS 

1 

Yes 

Yes 

Yes 

No 

Yes 

Yes 

2 

Yes 

No" 

Yes 

No 

No 6 

No" 

3 

No 

No 

No 

No 

No 

Yes 


"Zero-scores for nominal IMS training data applicable only for current 
working version of IMS, may change in future versions 
fc No composite score available for best performing variant 
"Does not process multivariate time series datasets 


would result in a single, composite numerical score. For these 
reasons, it will not be considered as a viable candidate. Table 
III summarizes the algorithms discussed thus far and which 
requirements are met by each. It is therefore clear that by using 
Orca or SVM in tandem with LDS, all of our requirements 
are fully covered. 

With the proposed architecture in mind, we return to our 
primary focus, which is to assess model fidelity for both 
branches of the algorithmic setup shown in Fig. 2. In order to 
make a valid argument for the practical use of any anomaly 
detection algorithms that result from the application of the 
architecture in Fig. 2, we will need to address the issue of 


The state of the system, x/ ;: £ R” evolves according to 
these equations, and often characterizes some internal physical 
characteristic of the system, beginning at time k — 0, with 
value x 0 via state matrix A. The scalar output of the system 
is given by y k £ R, and evolves through output matrix C. 
Both input noise (w/,), which influences the state trajectory, 
and measurement noise, ( v k ) which influences the output 
are introduced in order to allow for a more realistic model. 
The noise is modelled stochastically via a standard Gaussian 
distribution with means and covariances specified above. /y x 
is the mean of the state trajectory, and Po is the initial 
state covariance. Therefore the parameters to be learned are 
specified below, as the parameter 0. These parameters are 
also shown in Fig. 5, which specify them in relation to the 
probabilistic graphical modeling paradigm which may be used 
for machine learning purposes. 

0=(/x x ,P o ,A,C,Q ,R) (3) 

Let | ■ | represent the number of elements that comprise a 
parameter of 9, for example 



Fig. 5. Linear Dynamic System 


1/4x1 

= n 

|A| 

= TV 

|C| 

= n 

IQI 

= TV 

|Po| 

= 7 V 

\R\ 

= 1 


Therefore, a formula for the total number of parameters to 
be learned is shown in Eqn. 4. 


parameterize more elaborate control system architectures that 
include PI controllers, use state feedback, or use even more 
sophisticated techniques from control theory. Furthermore, 
enforcing these constraints during learning implicitly reduces 
the dimension of the parameter space. 

We can estimate the natural frequency by making an as- 
sumption of e(t) to be represented by a zero-mean stationary 
Gaussian random process. In this case, we can use Rice’s 
formula for the level-upcrossing rate [15] [27], as shown in 
Eqn. 10, to compute the natural frequency, ui n = — . This 
formula can be derived very easily [20], and is used in similar 
studies [8] [9] [21], 


|$| — l/fxl + |Po| + 1 -A- 1 + | C | + |Q| + \R\ — 3 n" + 2n + 1 (4) 

The notation for the equations is shown as in [17] for 
simplicity and generality. However, if we consider the special 
case of control system error as the output, y £, the notation 
becomes trickier to bookkeep, as was performed in [18]. 
Furthermore, the details on discretization of the continuous- 
time LDS and initialization using basic assumptions are also as 
presented in [18]. A brief summary is provided here, as these 
details relate to one of various initialization/learning strategies 
we will investigate. 

The state dynamics of an open-loop plant, V, as shown in 
Fig. 1 can be expressed by equations 5-6 1 

x(f) = A c x(i) + B c u(i) + T c w(t) (5) 

y{t) = C c x(f) + v{t ) (6) 


A c 

r c 

c c 


o 1 

0 

"n 

[1 0 ] 

^L e -K ^) 2 

27rcr e 


(7) 

( 8 ) 
(9) 

(10) 


The number of zero-upcrossings, i/+, of L = 0 by the 
sample data, and the 2"' i - order statistics: //,,, and <j e can all 
easily be obtained in order to find ui n by using Rice’s formula. 
During the learning procedure for the linear dynamic system, 
the EM algorithm is used to find the parameters shown in 
Fig. 5. Details of this procedure are provided in Ghahramani 
and Hinton [11] as well as Digalakis et. al. [7], and it is 
implemented using Murphy’s BNT (Bayes’ Net Toolbox) [24]. 


where 

w{t) ~ AT(0,Q c ) 
v{t) ~ Af(0 ,Rc) 

The controllable canonical form shown in Eqns. 7-9 is used 
to allow for a mapping to intuitive canonical parameters: the 
natural frequency, u> n , and the damping ratio, (. Constraining 
ourselves to this basic canonical form is not only intuitively 
appealing, but it may also allow for us to ultimately appeal 
to the control research community. State-space models of 
this form offer the building blocks, however primitive, to 

'All coefficients are subscripted by c for “continuous” in order to disam- 
biguate between this and the unsubscripted discrete analogue shown in Eqns. 

1 - 2 . 


A. Initialization and Learning Strategies 

One method of initialization/learning is to set £ = 1 and 
“clamp” uj n during training. Initial values for A c and T c can 
then be derived as a function of £ and to n . C c = [ 1 0 ] 

is also fixed during learning, and R is initialized by making 
a random guess at the SNR (signal to noise ratio), so that 

JD °~e 

_ SNR' 


A C X SS + X ss A^T 

Q c 


r r T 

L C 1 - c 

o\ - Rc 
C c X ss C c T 


( 11 ) 

( 12 ) 


Using these assumptions, we apply the steady-state 
continuous-time Lyapunov equation (w/ solution X ss ) in order 


to find an adequate initialization for Q c , as is performed in 
[20], [21], and shown in Eqns. 11-12. We then discretize all 
parameters using the sampling interval T s using the procedure 
outlined in [21], allowing us to form Eqns. 1-2 (details 
omitted for clarity). Furthermore, we use the solution of 
the discrete algebraic Lyapunov equation (Eqn. 13) as an 
initialization for Pq, and initialize // x = 0, After learning, 
we can then back out the learned value of the damping ratio 
C and the signal to noise ratio. 


Pn = P„ = AP^A 7 


Q 


(13) 


Another initialization/learning technique involves the same 
initialization strategy, but relaxation of the “clamping” of ui n 
during training. In this way the value of oj n can also be learned 
in addition to the damping ratio £, and the signal to noise 
ratio. However, we still enforce the canonical form constraint 
throughout the learning process. Finally, we will investigate 
a constraint-free learning process, although the initialization 
strategy remains identical to the two previous cases, and 
enforcement of the canonical form constraint is applied after 
the learning process. 

For comparison, two alternative initialization strategies will 
be tested as well, one of them which is least desirable of all, 
that being a semi-random approach. Here we will randomly 
initialize A, C, and Q, using Eqn. 13 again to find Po- In 
order to ensure that Q is a Hermitian matrix, we apply the 
following transformation: . Finally, we make a random 

guess at the SNR (signal to noise ratio) in order to choose R, 
and initialize // x = 0, as before. 

Factor analysis is the final alternate initialization strategy 
to be tested. To use an analogy from within the probabilistic 
graphical modeling paradigm, factor analysis is to LDS (or 
Kalman filter) as the GMM is to the HMM (Hidden Markov 
Model) [14], That is, the main assumption shared by the 
GMM and factor analysis is the lack of causal conditional 
dependencies among the hidden variables. In the GMM/HMM 
domain the hidden variables are discrete/multimonial random 
variables, and in the factor analysis/linear dynamic system 
domain the hidden variables all have continuous Gaussian 
distributions. As such, the same machinery for learning the pa- 
rameters (the EM algorithm) is used for factor analysis, which 
provides us with an informed set of initialization parameters, 
and has been performed in relevant machine learning venues 
[12]. Table IV summarizes and assigns case numbers to all 
initialization and learning strategies discussed thus far and to 
be investigated in the subsequent section. 

We will also quantify the levels of suboptimality introduced 
by using these various initialization schemes in the subsequent 
section. We enforce the constraints during learning, or after 
learning so that the model structure will adhere to the canon- 
ical form. A more technically sound approach to initialization 
may be to apply a random perturbation technique of the kind 
often seen in algorithms such as stochastic local search [23]. 
A more technically sound approach to learning would be to 
derive a modified M-step so that the parametric updates are an 


explicit function of desired parameters (the intuitive canonical 
parameters such as SNR, oj n , and £, and potentially even 
control gains). 

The only other research that addresses a similar control 
theoretic approach to using probabilistic graphical modeling 
is Deventer et al. [6] and Deventer [5], whose work will 
also be considered for comparison in future studies. However, 
in Deventer’s work, controller design is part of the research 
problem. We will assume that the controller has already been 
designed and attempt to learn its dynamical structure and that 
of the plant. All of these methods will be investigated in future 
studies. 

Furthermore, all of the initialization and learning strategies 
discussed thus far consider only the control system error as the 
primary data source. If we alternatively use the transformation 
of the Orca or SVM score as the data source, we may use a 
context free initialization strategy such as factor analysis, so 
as not to cast the data in a particular domain (i.e. intuitive 
canonical parameters for control system error). We may also 
lift all restrictions during training, and have more flexibility 
with the order of the model. As such, we will consider model 
orders ranging from n = 2 to n = 10 for the Orca and 
SVM scores, in addition to models trained by using the control 
system error as a data source. 

B. Model Likelihood and Kalman Filter Equations 

As of yet, we have not discussed the metric with which 
we will ascertain model fidelity. As a precursor, let us discuss 
the log-likelihood function of our model which is maximized 
during each iteration of the M-step. The likelihood of the data 
given our model parameters can be expressed as follows: 

T 

p(yo,...,y T \0) = []A((£ fe ;0,CP fc | fc _ 1 C T + f?)(14) 

k = 0 

Uk Vk\k — 1 (15) 

where T is the total number of observed samples, and £/, in 
Eqn. 15 is the white noise innovation process. Other definitions 
are provided below. 


Xfc|fc = E[x k \yo,...,yk] 

Pfc|fc = ^[(xfe-x fe | fc )(x fe -x fc | fc ) T |t/ 0 ,...,y fc ] 


Furthermore, 

Vk\k 


(16) 

X-k-\-l\k 

= Axfcjfc 

(17) 

Ffc+l|fc 

= P fc+t|fcC T (CP fc+1 | fc C T + f?) _1 

(18) 

Pfc+l|fc 

= AP fc|/eA T + Q 

(19) 

P fc+l|fc+l 

P/c+l|/c Pfc+l|fc^P/c+l|fc 

(20) 


Eqn. 18 represents the dynamically updated Kalman gain, 
and combining the two equations 19 and 20, we may obtain 
the following: 



TABLE IV 

Initialization and Learning Strategies for Control System Error 


Case Label 

Training Constraints 

Parameter Clamping 

Initialization Type 

Case #1 

Canonical 

Natural Frequency 

Data-Driven, Canonical 

Case #2 

Canonical 

None 

Data-Driven, Canonical 

Case #3 

None 

None 

Data-Driven, Canonical 

Case #4 

None 

None 

Semi-Random 

Case #5 

None 

None 

Factor Analysis 


Pfc+i|fc — AP/tifc^A 7 — AF fc |^._ 1 CPfc|/ s _iA T +Q (21) 

Thus far, all equations have been introduced under the 
assumption that y k is zero mean process, without loss of 
generality. This was allowed for the sake of mathematical 
convenience. We must make a distinction between the control 
system error and the SVM or Orca composite anomaly scores 
providing the basis for the training dataset. The control system 
error is close enough to a zero mean process qualitatively 
and quantitatively to allow for the mathematical representation 
introduced thus far to be used (cf. Fig. 3). However, when 
using the SVM or Orca composite anomaly scores as the 
basis for the dataset, the zero-mean assumption clearly fails, 
as evidenced in Fig. 4. As such, we will briefly highlight how 
to use the current mathematical formulation without loss of 
generality. There are a number of ways to handle data with a 
non-zero mean, however in our case we will add a term to the 
state and output equations as shown in Eqns. 22- 23. 

x fc+ i = Ax fc + B u k + w fc (22) 

Vk = C x fc + v k (23) 

where 



ttfc — V/c tE 1, . . . , T 

Fy 

C(I„ - A) _1 B 

We assume u k G ffi. is a scalar whose steady-state value 
can easily be determined by the use of Eqn. 22, and B is 
chosen to be a fixed coefficient out of convenience. This is 
practical because B is not updated during the learning process, 
and as such does not require clamping which might otherwise 
introduce suboptimality. is empirically determined from the 
training data set, and is used for validation as well. Propagating 
this extra term through the Kalman filter will result in the 
updates shown in Eqn. 24 

Xfc+i|fc = Ax fc | fc + B u k (24) 

Alternatively, we could have introduced an additive constant 
to Eqn. 23 in order to account for the non-zero mean. However, 
the formulation shown above allows us to appeal to supervised 
learning problems or dynamic systems which may actually 
require the use of an driving input that changes with time, as 
is the case in control theory. 


C. Akaike Information Criterion 

An expression for the log-likelihood function follows easily 
from [10], shown in Eqn. 26. The more general expression is 
proved in [16], and the expression shown in Eqn. 26 is derived 
by using the assumptions and notation introduced thus far, in 
addition to Eqn. 25. 

= CPfe^C 7 + R (25) 


T ^ 

logp(t/o, • • • ,y T \0) = -- log 2 tt - log Ofc + (£fcCT fc ) 2 

k = 0 

(26) 

However, it is well known that even though this is implicitly 
the objective function of the MLE (Maximum Likelihood Es- 
timation) problem performed iteratively during the M-step of 
the EM learning algorithm 2 , we cannot use this as an unbiased 
indicator for the assessment of model fidelity [30]. The form 
of the log-likelihood function itself would otherwise seem to 
indicate some measure of how well the data ( y 0 , . . . ,yx) fit 
the model ( 9 ). The “log” part of the log-likelihood function is 
introduced for mathematical convenience and has no bearing 
on the result of the MLE problem since it is a monotonic 
operator. 

The bias stems from the fact that the learned model para- 
meters that comprise 0 are used to obtain the value for the 
log-likelihood. Intuitively, any metric used to assess how well 
the data fit the model should not contain any parameters used 
to train that model. The bias introduced can also be derived 
with more mathematical rigor, using a more precise definition. 
However, we will forgo those derivations here and speak about 
bias in a more qualitative sense. 

This bias can be accounted for by using the Akaike infor- 
mation criterion (AIC), which is based upon the Kullback- 
Leibler (KL) divergence. The KL divergence is a measure of 
the distance between the modeled distribution and the true 
distribution. Therefore, the AIC is often used to guide model 
selection due to its inherent capability to assess the fit of the 
data to the model based upon the number of model parameters. 
As such, it can be use for the assessment of model fidelity. 

The biased term shown in Eqn. 27 acts as a proxy for the 
precise definition of KL divergence. The AIC adjusts for the 

2 More precisely, the expected complete log-likelihood function is the 
objective function under consideration for MLE. in place of the log-likelihood 
function. This involves the use of sufficient statistics represented as the 
expected value of the hidden variables P /, | /,. and x fe k computed during the 
E-step. 




inherent bias by adding a bias correction term as shown in 
Eqn. 28, which is an approximation of the expected difference 
between the KL divergence and the computed bias term. The 
AIC defined in Eqn. 29 is the sum of the biased term and the 
bias correction term, where 9q represents the parameters of 
the true model. 

%(yo,---,VT,0) = — 2\ogp{y 0 , . . . ,y T \0) (27) 

T c (0o) « m (28) 

AIC = T b (y o ,...,y T ,O) + T c (0 o ) (29) 

We will explore the use of a method recently introduced by 
[2] which is a variant of the AIC explicitly derived for state- 
space models, denoted AlCi. The AlCi criteria is a revised and 
improved approximation to the bias correction term, which 
is robust for small sample settings. It also tends to report 
less deviation from the true model when the trained model is 
overparameterized. Because the bias correction term represents 
the expected difference between the KL (Kullback-Leibler) 
divergence and the computed biased term, we can estimate it 
via Monte Carlo simulation. We can therefore replace Eqn. 28 
with Eqn. 30 which uses an ensemble average of this expected 
difference by using a convenient sampling distribution. 

The sampling distribution, yk ~ Af( 0, 1) is used to generate 
M distinct T-sample training cases, and M independently 
sampled additional T-sample test cases. Essentially, there are 
2 M sets of data generated. The first M sets of data are used 
as training data to generate corresponding sets of new model 
parameters, and the second M sets of data are used as test 
cases. The constant coefficient, C (j), shown indexed by j 
corresponds to the j th model generated by the j th set of 
training data. P fc | k (j) and x k \ k _i(j) correspond to the j th 
set of test data using the j th model. 

m / T 

T c (0o) « - T +mH E ^e (P^ fc (i)) + ... (30) 
i= i Vfc=o 

J2^l\k-iU)c T U)Pk\lU)cU)^k\ k -iU ) J 

fc = 0 / 

Because the AIC and AlCi criteria both provide a measure 
of disparity between the true and fitted model, we would like 
to minimize either metric to as low a value as possible. In 
the next section we will evaluate both the quantitative and 
computational differences between using both methods and 
use the method that demonstrates the best performance with 
respect to these requirements. As such, we will also provide 
an interpretation of the results for the chosen metric using all 
initialization and learning strategies discussed thus far, and for 
both training and validation sets shown in Table II. 

IV. Results 

We begin the presentation of our results for the case in 
which we train on control system error. It is used both for 
training and validation based upon the data shown in Table II. 


TABLE V 
AIC Results 


Case Label 

AIC II 

Initial 

Final 

Adjusted 

Validation 

Case #1 

27432 

27349 

27349 

57805 

Case #2 

1046211 

27084 

27084 

57743 

Case #3 

27432 

26954 

27341 

57444 

Case #4 

43421 

26957 

30522 

57301 

Case #5 

27440 

26954 

28478 

57198 


The model order in this case is restricted to n = 2 in order to 
allow for adherence to canonical form. 

Thorough testing was performed on the quantitative and 
computational differences between the AIC and AlCi metrics. 
It was found that the resulting AlCi metric values were on par 
with the AIC values, but at an added computational burden 
of 1615:1 when using M = 10 as the number of training 
and test cases to use for the ensemble average shown in 
Eqn. 30. Case # 1 was used for testing purposes, and on 
average the AlCi values were fractions of a percent different 
from the values listed on the first row in Table V for AIC. 
This may inherently be due to the modest dimension of the 
parameter space (n = 2 => \6\ = 17), and due to the 
fact that the approximation for the AIC criterion is sufficient 
due to the large sample size (T = 6507). Furthermore, the 
same canonical parameters resulted when using both methods, 
corresponding to the first row of Table VI. As such, unless 
otherwise stated, we will apply the AIC metric for our measure 
of model fidelity. 

In Table V, we see that the lowest AIC values are the 
learned models (in the “final” column) for Cases #3 and #5. 
This corresponds to unconstrained learning, using data-driven 
canonical initialization and factor analysis-based initialization, 
respectively. Intuitively, this result makes sense due to the 
freedom from constraints, and the use informative initialization 
schemes. The final AIC score for Case #4 in which the less 
informative semi-random initialization was used does not yield 
a much different result than Cases #3 and #5. However, it does 
appear that the AIC score for Case #4 prior to training was 
well above the other cases, with the notable exception of Case 
# 2 . 

Case #2 requires adherence to the canonical form during 
training, however neither of the canonical parameters are 
clamped, as opposed to Case #1 in which only u> n was required 
to be clamped. As such, the same initialization strategy used 
for Case #1 failed due to non-convergence of the learning 
process. Therefore an alternate set of parameters as shown 
in the corresponding row in Table VI was used. Table VI 
also illustrates that the damping ratio and natural frequency 
values are often pathological (i.e. Cases #4 and #5). However, 
realizations of the resulting linear dynamic system suffice to 
qualitatively represent the training data for the control system 
error. At any rate, a coarse grid search was used to find 
the parameters corresponding to Case #2, which yielded a 
successfully convergent learning regime. 

The initial AIC score shown in Table V for Case #3 was 




TABLE VI 

Canonical Parameter Results 


Case Label 

Natural Frequency ( u > n ) 

Damping Ratio (0 

Signal-to-Noise Ratio (SNR) II 

Initial 

Final 

Initial 

Final 

Initial 

Final 

Case #1 

34.06 

34.06 

1 

0.614 

100 

72.2 

Case #2 

34.06 

94.79 

0.001 

1.48 

10000 

128.34 

Case #3 

34.06 

37.38 

1 

2.04 

100 

59.2 

Case #4 

34.06 

0.005 

1 

21769 

100 

57.78 

Case #5 

34.06 

0.073 

1 

216 

1.56 

42.02 


hence understandably very high due to the extreme parametric 
settings required for convergence. However, the final AIC 
score obtained after the learning process were on the same 
order of magnitude as for the remaining cases. The final 
AIC scores were higher for Cases #1 and #2 that imposed 
constraints during learning, and understandably highest for 
Case #1, in which the natural frequency parameter, u) n , was 
clamped. 

Table V also lists the AIC scores for canonical adjustments 
that occur after training, in the fourth column labelled “Ad- 
justed.” For Cases #1 and #2, in which the constraints were 
applied throughout the learning process, the AIC score clearly 
does not change. However, for the remaining Cases #3 - #5, a 
modest level of suboptimality is introduced, as evidenced by 
the higher AIC scores. Intuitively, even though no constraints 
were placed on the form of the parameters during training 
in Case #3, an adjustment to enforce the canonical constraint 
introduces less suboptimality than the other Cases # 4 and #5 
since they were not initialized in canonical form. 

Therefore, it is clear that if a strong requirement for the 
restriction to canonical form exists, the best method to use 
is Case #2. This case actually enforces the canonical form 
during training, but does not clamp any of the parameters. 
However, if there is no strong requirement for adherence to 
the canonical form, the cases that do not enforce constraints 
during training, specifically Case # 3 and #5, which do not 
use random initialization demonstrate superior performance. 

The final column in Table V show the AIC score for the 
validation data as applied to either the final or “adjusted” 
model, whichever has the lowest AIC score. As might be 
expected, the validation scores are much higher than the 
learned AIC scores due to the fact that more than half of 
the validation data contain anomalies (cf. Table I). However, 
it is evident that the AIC score decreases as the case number 
increases. 

Case #1 is the most restricted, in which both clamping 
and constraints are enforced during learning, leading to the 
highest reported validation score. Case #2 does not clamp 
parameters during learning, although the canonical constraint 
is enforced, resulting in a slightly lower reported AIC score 
for the validation data. The remaining cases are constraint-free 
during the learning process, with Case #5 demonstrating the 
most favorable AIC score for the validation data. As such, 
we will use factor analysis as the default initialization method 
to study the serial architecture. This involves using the SVM 
and Orca composite anomaly scores as training and validation 


data, in addition to control system error for comparison. The 
results will be presented graphically as a function of model 
order. 

We conclude this section with a discussion of the results 
for the cases related to the use of the serial architecture. 
One of the most ubiquitous observations in this part of the 
investigation is that the learning phase often fails, due to matrix 
singularities that arise when using model orders above various 
thresholds for different techniques. This occurs even when 
using factor analysis for initialization, which is a far superior 
method than randomization. Using the “data-driven canonical” 
approach to initialization is valid only for a model order of 
n = 2 for the framework chosen in this paper. However, it is 
certainly possible to augment the state-space in canonical form 
to allow for a more generic parametric representation that is 
less intuitively appealing. Alternatively, we may even explore 
the use of a linearized physics-based model that is defined 
as a function of physical parameters for the prior distribution 
(i.e. initialization). Hence, we can implicitly integrate physics- 
based and data-driven methods in a Bayesian context, while 
augmenting our model order in an informative manner. 

Using control system error and SVM anomaly scores for the 
training data, the learning phase failed to converge for model 
orders of n = 4 or higher, and for model orders of n = 3 
or higher when using Orca anomaly scores for the training 
data. This may imply overparametrization, or a fundamental 
identification of the maximum specified model order, which is 
not known apriori. However, in order to provide a sanity check 
on these results, we may use the AlCi criterion in addition to 
the AIC criterion. This is done in order to account for the fact 
that the AIC criterion often yields a biased lower score for 
models that are overparameterized due to higher model orders. 
However, the model learning part of the procedure outlined in 
association with Eqn. 30 failed. This was due again to matrix 
singularities that arose when using higher model orders. For 
the lowest model orders that did converge, the AlCi criteria did 
serve to corroborate that the AIC criteria was fairly unbiased 
and accurate. Again, on average the AlCi values were fractions 
of a percent different from the AIC values. 

Even though certain models cannot be successfully trained 
for all model orders, we may still investigate their fidelity. Both 
training and validation data can be used to generate AIC scores 
that are shown as function of model order. The model based 
upon the initialization using factor analysis can easily be used 
to compute the AIC for both training and validation data. The 
results are shown in Figs. 6-8, for all three cases illustrated 





Fig. 6. AIC Result for Control System Error 



Fig. 7. AIC Result for SVM 


in the serial architecture. Where possible, the AIC score is 
also shown for models with lower orders that successfully 
completed the training phase. 

As seen in Fig. 6, the validation AIC score is nearly double 
that of the training AIC score for all model orders shown, as 
might be expected. Also, there is a slight rise in both training 
and validation AIC scores with model order. This may be 



Fig. 8. AIC Result for Orca 


harder to discern for the training AIC score due to an outlier at 
n = 7, however, there is a general upward trend which meets 
with intuition. Furthermore, as previously alluded to, the AIC 
score is shown for the models with lower orders ( n = 2, 3) 
that successfully completed the training phase. 

In comparing these results to those shown in Fig. 7, we first 
draw the reader’s attention to the discrepancies between the 
scales of both Fig. 6 and 7. In general, the AIC scores are much 
more extreme for the SVM-based AIC scores shown in Fig. 
7 than in the control system error-based AIC scores shown in 
Fig. 6. The AIC score for the training data in Fig. 7 rises from a 
very small, even negative AIC score, to one that is much higher 
than in Fig. 6 for higher models order. This exaggerated rise 
in AIC with model order speaks to the idiosyncracies of fitting 
an overparameterized LDS model to the SVM score. A similar 
characteristic is evident for the validation data, however, for 
lower model orders the AIC score is a much higher positive 
value than was found for the training data. In fact, for n = 2, 
the AIC scores based on the validation SVM data was on 
par with the validation control system error data, indicating a 
similar fit. In this case it may be more prudent to choose the 
SVM model using the serial architecture due to the limitation 
of using control system error cited earlier. 

The results of using Orca in the serial architecture are shown 
in Fig. 8. Due to the drastic difference between the scales of 
the AIC scores for training and validation data, the data labels 
are segregated and shown opposite of each other (training in 
the left in blue, and validation on the right in green). It is clear 
that the AIC scores exhibited here are orders of magnitude 
greater than for either of the previous cases shown in Figs. 
6-7. Neither exhibits the expected monotonic increase in AIC 
with model order. All of these observations are consistent with 
the fact that the model is not fit well to the data. As such, we 
can certainly look more at this issue in future work using the 
strategies outlined previously that are shared with the SVM 
approach. 

The lowest validation AIC score out of all models tested 
was for the SVM composite anomaly score, when using a 
model order of n = 3, having a value of 13473. As such, this 
should certainly be a candidate for application to subsequent 
anomaly detection studies. This should be preceded by a more 
thorough examination of the subtleties for any potential bias 
of the reported AIC score by using AlCi for corroboration if 
possible. Furthermore, remedies to the matrix singularity issue 
need to be investigated in more depth, as well as other issues 
introduced earlier. One such example is to exploit the fact that 
the SVM score is a distance-based metric in order to achieve 
a better qualitative fit of the data to the model. 

V. Conclusions 

In this paper, we have examined the model fidelity of 
several competing data transformation techniques as measured 
by the AIC (Akaike information criterion). When using control 
system error as the sole source of data, we have found that 
the use of canonical constraints during training decreases 
model fidelity. However, if the canonical constraint is deemed 


compulsory, no clamping should be used in order to achieve 
the lowest AIC score. Furthermore, when using the canonical 
constraint in future studies, suboptimality may be avoided 
by deriving a modified M-step during learning. Lifting the 
canonical requirement implies constraint free learning, and the 
use of either data-driven initialization or initialization based 
upon factor analysis yields the best model fidelity. 

We have also found that the SVM composite anomaly 
score yields the lowest validation AIC score, as is such 
a candidate for future study for application to subsequent 
anomaly detection studies. Part of this study should include 
the investigation of matrix singularities that appear during 
learning when initialized by factor analysis. This may be 
remedied by using more principled initialization techniques 
such as stochastic local search. 

Finally, by using the serial architecture in lieu of the control 
system error alone, we are implicitly reducing the entire 
feature space into a univariate signal while retaining salient 
operational signatures. This is potentially a far more effective 
approach than using only a small fraction of the feature space 
by using the control system error alone. As such we may 
potentially allow for many more anomalies to be detected by 
using this paradigm. 
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